Equi-depth Histogram Construction for Big Data with Quality Guarantees

نویسندگان

  • Burak Yildiz
  • Tolga Büyüktanir
  • Fatih Emekçi
چکیده

The amount of data generated and stored in cloud systems has been increasing exponentially. The examples of data include user generated data, machine generated data as well as data crawled from the Internet. There have been several frameworks with proven efficiency to store and process the petabyte scale data such as Apache Hadoop, HDFS and several NoSQL frameworks. These systems have been widely used in industry and thus are subject to several research. The proposed data processing techniques should be compatible with the above frameworks in order to be practical. One of the key data operations is deriving equi-depth histograms as they are crucial in understanding the statistical properties of the underlying data with many applications including query optimization. In this paper, we focus on approximate equi-depth histogram construction for big data and propose a novel merge based histogram construction method with a histogram processing framework which constructs an equi-depth histogram for a given time interval. The proposed method constructs approximate equi-depth histograms by merging exact equi-depth histograms of partitioned data by guaranteeing a maximum error bound on the number of items in a bucket (bucket size) as well as any range on the histogram.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Fast and Space-Efficient Computation of Equi-Depth Histograms for Data Streams

Equi-depth histograms represent a fundamental synopsis widely used in both database and data stream applications, as they provide the cornerstone of many techniques such as query optimization, approximate query answering, distribution fitting, and parallel database partitioning. Equi-depth histograms try to partition a sequence of data in a way that every part has the same number of data items....

متن کامل

Rectangular Attribute Cardinality Map: A New Histogram-like Technique for Query Optimization

Current database systems utilize histograms to approximate frequency distributions of attribute values of relations. These are used to efficiently estimate query result sizes and access plan costs. Even though they have been in use for nearly two decades, there has been no significant mathematical techniques (other than those used in statistics for traditional histogram approximations) to study...

متن کامل

A Learning Framework for Self-Tuning Histograms

In this paper, we consider the problem of estimating self-tuning histograms using query workloads. To this end, we propose a general learning theoretic formulation. Specifically, we use query feedback from a workload as training data to estimate a histogram with a small memory footprint that minimizes the expected error on future queries. Our formulation provides a framework in which different ...

متن کامل

Sketch Techniques for Approximate Query Processing

Sketch techniques have undergone extensive development within the past few years. They are especially appropriate for the data streaming scenario, in which large quantities of data flow by and the the sketch summary must continually be updated quickly and compactly. Sketches, as presented here, are designed so that the update caused by each new piece of data is largely independent of the curren...

متن کامل

The Efficiency of Histogram-like Techniques for Database Query Optimization

One of the most difficult tasks in modern day database management systems is information retrieval. Basically, this task involves a user query, written in a high-level language such as the Structured Query Language, and some internal operations, which are transparent to the user. The internal operations are carried out through very complex modules that decompose, optimize and execute the differ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1606.05633  شماره 

صفحات  -

تاریخ انتشار 2016